Nonlinear Models (ISL 7)

Biostat 212A

Author

Dr. Jin Zhou @ UCLA

Published

May 28, 2024

Credit: This note heavily uses material from the books An Introduction to Statistical Learning: with Applications in R (ISL2) and Elements of Statistical Learning: Data Mining, Inference, and Prediction (ESL2).

Display system information for reproducibility.

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.3    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.3       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        xfun_0.42         digest_0.6.34    
[13] jsonlite_1.8.8    rlang_1.1.3       evaluate_0.23    
import IPython
print(IPython.sys_info())
{'commit_hash': '49914f938',
 'commit_source': 'installation',
 'default_encoding': 'utf-8',
 'ipython_path': '/Users/jzhou/.virtualenvs/r-reticulate/lib/python3.9/site-packages/IPython',
 'ipython_version': '8.18.1',
 'os_name': 'posix',
 'platform': 'macOS-14.5-arm64-arm-64bit',
 'sys_executable': '/Users/jzhou/.virtualenvs/r-reticulate/bin/python',
 'sys_platform': 'darwin',
 'sys_version': '3.9.17 (main, Jul 26 2023, 20:27:39) \n'
                '[Clang 14.0.3 (clang-1403.0.22.14.1)]'}

1 Overview

  • The truth is never linear! Or almost never!

    But often the linearity assumption is good enough.

  • When it’s not …

    • polynomials
    • step functions
    • spline
    • local regression, and
    • generalized additive models

    offer a lot of flexibility, without losing the ease and interpretability of linear models.

  • Main idea

    • To augment/replace the vector of inputs \(X\) with additional variables, which are transformations of \(X\), and then use linear models in this new space of derived input features.
    • For example, in regression problems, \(f(X) = \mbox{E}(Y\mid X)\) is modeled as a linear function of \(X\), but now to model is by transformation of \(X\), i.e., \(h_m(X)\).

\[ f(X) =\sum_{m=1}^M \beta_m h_m(X) \]

  • The beauty of this approach is that once the basis functions \(h_m\) have been determined, the models are linear in these new variables, and the fitting proceeds as for linear models.

2 Polynomial regression

In most of this lecture, consider \(p=1\) in the following examples. \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \epsilon_i. \]

# Plot wage ~ age, display order-4 polynomial fit
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point() + 
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 4)
    ) +
  labs(
    title = "Degree-4 Polynomial",
    x = "Age",
    y = "Wage (k$)"
    )

# Visualize wage ~ age, display order-4 polynomial fit
plt.figure()
sns.lmplot(
  data = Wage, 
  x = "age", 
  y = "wage", 
  # Order-4 polynomial regression
  order = 4,
  scatter_kws = {'alpha' : 0.1},
  height = 8
  ).set(
  xlabel = 'Age', 
  ylabel = 'Wage (k$)',
  title = 'Degree-4 Polynomial'
  );
plt.show()
  • Create new variables \(X_1 = X\), \(X_2 = X^2\), …, and then treat as multiple linear regression.

  • Not really interested in the coefficients; more interested in the fitted function values at any value \(x_0\): \[ \hat f(x_0) = \hat{\beta}_0 + \hat{\beta}_1 x_0 + \hat{\beta}_2 x_0^2 + \hat{\beta}_3 x_0^3 + \hat{\beta}_4 x_0^4. \]

# poly(age, 4) constructs orthogonal polynomial of degree 1 to degree, all orthogonal to the constant
lmod <- lm(wage ~ poly(age, degree = 4), data = Wage)
summary(lmod)

Call:
lm(formula = wage ~ poly(age, degree = 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.707 -24.626  -4.993  15.217 203.693 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             111.7036     0.7287 153.283  < 2e-16 ***
poly(age, degree = 4)1  447.0679    39.9148  11.201  < 2e-16 ***
poly(age, degree = 4)2 -478.3158    39.9148 -11.983  < 2e-16 ***
poly(age, degree = 4)3  125.5217    39.9148   3.145  0.00168 ** 
poly(age, degree = 4)4  -77.9112    39.9148  -1.952  0.05104 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.91 on 2995 degrees of freedom
Multiple R-squared:  0.08626,   Adjusted R-squared:  0.08504 
F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
# poly(age, 4, raw = TRUE) procudes raw othogonal polynomial, which match Python
lmod <- lm(wage ~ poly(age, degree = 4, raw = TRUE), data = Wage)
summary(lmod)

Call:
lm(formula = wage ~ poly(age, degree = 4, raw = TRUE), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.707 -24.626  -4.993  15.217 203.693 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)                        -1.842e+02  6.004e+01  -3.067 0.002180 ** 
poly(age, degree = 4, raw = TRUE)1  2.125e+01  5.887e+00   3.609 0.000312 ***
poly(age, degree = 4, raw = TRUE)2 -5.639e-01  2.061e-01  -2.736 0.006261 ** 
poly(age, degree = 4, raw = TRUE)3  6.811e-03  3.066e-03   2.221 0.026398 *  
poly(age, degree = 4, raw = TRUE)4 -3.204e-05  1.641e-05  -1.952 0.051039 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.91 on 2995 degrees of freedom
Multiple R-squared:  0.08626,   Adjusted R-squared:  0.08504 
F-statistic: 70.69 on 4 and 2995 DF,  p-value: < 2.2e-16
  • Since \(\hat f(x_0)\) is a linear function of the \(\hat{\beta}_j\), we can get a simple expression for pointwise-variances \(\operatorname{Var}[\hat f(x_0)]\) at any value \(x_0\).

  • We either fix the degree \(d\) at some reasonably low value, or use cross-validation to choose \(d\).

  • Can do separately on several variables. Just stack the variables into one matrix, and separate out the pieces afterwards (see GAMs later).

  • Polynomial modeling can be done for generalized linear models (logistic regression, Poisson regression, etc) as well.

  • Caveat: polynomials have notorious tail behavior. Very bad for extrapolation.

Code
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Polynomial regression with degree 14
  geom_smooth(
    method = "lm",
    formula = y ~ poly(x, degree = 14),
    color = "blue"
    ) +
  # Natural cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ ns(x, df = 14),
    color = "red"
    ) +  
  labs(
    title = "Natural cubic spline (red) vs polynomial regression (blue)",
    subtitle = "Both have df=15",
    x = "Age",
    y = "Wage (k$)"
    )

from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features of age predictor
poly_tf = make_column_transformer(
  (PolynomialFeatures(degree = 4, include_bias = False), ['age']),
  remainder = 'drop'
)

# Define pipeline and fit to Wage data
pipe = Pipeline(steps = [
  ("poly_tf", poly_tf),
  ("model", LinearRegression())
])

# Fit pipeline
X = Wage.drop('wage', axis = 1)
y = Wage.wage
pipe.fit(X, y)
# R^2
pipe.score(X, y)
# Plot
plt.figure()
ax = sns.scatterplot(
  data = Wage,
  x = 'age',
  y = 'wage',
  alpha = 0.1
);
sns.lineplot(
  x = Wage['age'],
  y = pipe.predict(X),
  ax = ax
).set(
  title = "Polynomial regression (order = 4)",
  xlabel = 'Age', 
  ylabel = 'Wage (k$)'
);
plt.show()
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Fit linear regression
lmod = smf.ols(formula = 'wage ~ np.vander(age, 5, increasing = True) - 1', data = Wage).fit()
lmod.summary()
np.polyfit(Wage.age, Wage.wage, deg = 4)

3 Piecewise polynomials (regression splines)

  • Instead of a single polynomial in \(X\) over its whole domain, we can rather use different polynomials in regions defined by knots. E.g., a piecewise cubic polynomial with a single knot at \(c\) takes the form \[ y_i = \begin{cases} \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \beta_{31} x_i^3 + \epsilon_i & \text{if } x_i < c \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \beta_{32} x_i^3 + \epsilon_i & \text{if } x_i \ge c \end{cases}. \]

  • Better to add constraints to the polynomials, e.g., continuity.

  • Splines have the “maximum” amount of continuity.

3.1 Linear spline

  • A linear spline with knots at \(\xi_k\), \(k = 1,\ldots,K\), is a piecewise linear polynomial continuous at each knot.

  • We can represent this model as \[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+1} b_{K+1}(x_i) + \epsilon_i, \] where \(b_k\) are basis functions:
    \[\begin{eqnarray*} b_1(x_i) &=& x_i \\ b_{k+1}(x_i) &=& (x_i - \xi_k)_+, \quad k=1,\ldots,K. \end{eqnarray*}\] Here \((\cdot)_+\) means positive part \[ (x_i - \xi_i)_+ = \begin{cases} x_i - \xi_k & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. \]

3.2 Cubic splines

  • A cubic spline with knots at \(\xi_k\), \(k = 1,\ldots,K\), is a piecewise cubic polynomial with continuous derivatives up to order 2 at each knot.

  • Again we can represent this model with truncated power basis functions \[ y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i, \] with \[\begin{eqnarray*} b_1(x_i) &=& x_i \\ b_2(x_i) &=& x_i^2 \\ b_3(x_i) &=& x_i^3 \\ b_{k+3}(x_i) &=& (x_i - \xi_k)_+^3, \quad k = 1,\ldots,K, \end{eqnarray*}\] where \[ (x_i - \xi_i)_+^3 = \begin{cases} (x_i - \xi_k)^3 & \text{if } x_i > \xi_k \\ 0 & \text{otherwise} \end{cases}. \]

  • A cubic spline with \(K\) knots costs \(K+4\) parameters or degrees of freedom. That is \(4(K+1)\) polynomial coefficients minus \(3K\) constraints.

  • While the truncated power basis is conceptually simple, it is not too attractive numerically: powers of large numbers can lead to severe rounding problems. In practice, B-spline basis functions are preferred for their computational efficiency. See ESL Chapter 5 Appendix.

Code
from sklearn.preprocessing import SplineTransformer

# Cubic spline for age
X_age = np.array(X['age']).reshape(3000, 1)
x_plot = np.linspace(start = 15, stop = 85, num = 70)
X_plot = x_plot[:, np.newaxis]
bs_plot = SplineTransformer(
    degree = 3,
    # knots = np.array([25, 40, 60]).reshape(3, 1),
    n_knots = 5,
    extrapolation = 'continue',
    # include_bias = False
    ).fit(X_age).transform(X_plot)
    
ns_plot = SplineTransformer(
    degree = 3,
    # knots = np.array([25, 40, 60]).reshape(3, 1),
    n_knots = 5,
    extrapolation = 'linear',
    # include_bias = False
    ).fit(X_age).transform(X_plot)    
    
# Plot
fig, axes = plt.subplots(ncols = 2, figsize = (20, 6))
axes[0].plot(x_plot, bs_plot)
# axes[0].legend(axes[0].lines, [f"spline {n}" for n in range(4)])
axes[0].set_title("B-splines")

axes[1].plot(x_plot, ns_plot)
# axes[1].legend(axes[0].lines, [f"spline {n}" for n in range(8)])
axes[1].set_title("B-splines with linearity at boundary")
plt.show()

3.3 Natural cubic splines

  • Splines can have high variance at the outer range of the predictors.

  • A natural cubic spline extrapolates linearly beyond the boundary knots. This adds \(4 = 2 \times 2\) extra constraints, and allows us to put more internal knots for the same degrees of freedom as a regular cubic spline.

  • A natural spline with \(K\) knots has \(K\) degrees of freedom.

Code
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ bs(x, knots = c(25, 40, 60)),
    color = "blue"
    ) +
  # Natural cubic spline
  geom_smooth(
    method = "lm",
    formula = y ~ ns(x, knots = c(25, 40, 60)),
    color = "red"
    ) +  
  labs(
    title = "Natural cubic spline fit (red) vs cubic spline fit (blue)",
    x = "Age",
    y = "Wage (k$)"
    )

3.4 Knot placement

  • One strategy is to decide \(K\), the number of knots, and then place them at appropriate quantiles of the observed \(X\).

  • In practice users often specify the degree of freedom and let software choose the number of knots and locations.

4 Smoothing splines

  • Consider this criterion for fitting a smooth function \(g(x)\) to some data: \[ \text{minimize} \quad \sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 \, dt. \]

    • The first term is RSS, and tries to make \(g(x)\) match the data at each \(x_i\).
    • The second term is a roughness penalty and controls how wiggly \(g(x)\) is. It is modulated by the tuning parameters \(\lambda \ge 0\).
      • The smaller \(\lambda\), the more wiggly the function, eventually interpolating \(y_i\) when \(\lambda = 0\).
      • As \(\lambda \to \infty\), the function \(g(x)\) becomes linear.
  • It can be shown that this problem has an explicit, finite-dimensional, unique minimizer which is a natural cubic spline with knots at the unique values of the \(x_i\).

  • The roughness penalty controls the roughness via \(\lambda\).

  • Smoothing splines avoid the knot-selection issue, leaving a single \(\lambda\) to be chosen.

  • The vector of \(n\) fitted values can be written as \(\hat{g}_\lambda = S_\lambda y\), where \(S_{\lambda}\) is an \(n \times n\) matrix (determined by the \(x_i\) and \(\lambda\)).

  • The effective degrees of freedom are given by \[ \text{df}_{\lambda} = \sum_{i=1}^n S_{\lambda,ii}. \] Thus we can specify df rather than \(\lambda\).

  • The leave-one-out (LOO) cross-validated error is given by \[ \text{RSS}_{\text{CV}}(\lambda) = \sum_{i=1}^n \left[ \frac{y_i - \hat{g}_\lambda(x_i)}{1 - S_{\lambda,ii}} \right]^2. \]

ggformula package supplies geom_spline function for displaying smoothing spline fits.

Code
library(ggformula)
library(splines)

# Plot wage ~ age
Wage %>%
  ggplot(mapping = aes(x = age, y = wage)) + 
  geom_point(alpha = 0.25) + 
  # Smoothing spline with df = 16
  geom_spline(
      df = 16,
      color = "red"
    ) +
  # Smoothing spline with GCV tuned df
  geom_spline(
    # df = 6.8,
    cv = TRUE,
    color = "blue"
    ) +
  labs(
    title = "Smoothing spline with df=16 (red) vs LOOCV tuned df=6.8 (blue)",
    x = "Age",
    y = "Wage (k$)"
    )
Warning in smooth.spline(data$x, data$y, w = weight, spar = spar, cv = cv, :
cross-validation with non-unique 'x' values seems doubtful

  • Application to logistic regression \[ \log\frac{\mathbf{P}(Y=1|X)}{\mathbf{P}(Y=0|X)} = f(X), \] therefore \[ \mathbf{P}(Y=1|X) = \frac{\exp(f(X))}{1 + \exp(f(X))} = p(x). \]

  • The penalized The log-likelihood criterion \[ \ell(f,\lambda) = \sum_{i=1}^n \left[ y_i \log p(x_i) + (1-y_i) \log (1-p(x_i)) \right] - \lambda \int f''(t)^2 \, dt. \]

  • The solution is a natural cubic spline with knots at the unique values of the \(x_i\).

5 Local regression

  • With a sliding weight function, we fit separate linear fits over the range of \(X\) by weighted least squares.

  • At \(X=x_0\), \[ \text{minimize} \quad \sum_{i=1}^n K(x_i, x_0) (y_i - \beta_0 - \beta_1 x_i)^2, \] where \(K\) is a weighting function that assigns heavier weight for \(x_i\) close to \(x_0\) and zero weight for points furthest from \(x_0\).

  • Locally weighted linear regression: loess function in R and lowess in Python.

  • Anecdotally, loess gives better appearance, but is \(O(N^2)\) in memory, so does not work for larger data sets.

  • While all of these choices make some difference, the most important choice is the span \(s\), which is the proportion of points used to compute the local regression at \(x_0\).

  • The span plays a role like that of the tuning parameter \(\lambda\) in smoothing splines: it controls the flexibility of the non-linear fit.

    • The smaller the value of \(s\), the more local and wiggly will be our fit; alternatively, a very large value of s will lead to a global fit to the data using all of the training observations.
    • Cross-validation to choose s, or just specify it directly.

6 Generalized additive model (GAM)

  • Generalized additive models (GAMs) allows for flexible nonlinearities in several variables, but retains the additive structure of linear models. \[ y_i = \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p (x_{ip}) + \epsilon_i. \]

  • We can fit GAM simply using, e.g. natural splines.

  • Coefficients not that interesting; fitted functions are.

  • Can mix terms: some linear, some nonlinear, and use ANOVA to compare models.

  • Can use smoothing splines or local regression as well. In R: gam(wage ~ s(year; df = 5) + lo(age; span = :5) + education).

  • GAMs are additive, although low-order interactions can be included in a natural way using, e.g. bivariate smoothers or interactions of the form (in R) ns(age, df = 5):ns(year, df = 5).

Natural splines for year and age.

gam_mod <- lm(
  wage ~ ns(year, df = 4) + ns(age, df = 5) + education,
  data = Wage
  )
summary(gam_mod)

Call:
lm(formula = wage ~ ns(year, df = 4) + ns(age, df = 5) + education, 
    data = Wage)

Residuals:
     Min       1Q   Median       3Q      Max 
-120.513  -19.608   -3.583   14.112  214.535 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   46.949      4.704   9.980  < 2e-16 ***
ns(year, df = 4)1              8.625      3.466   2.488  0.01289 *  
ns(year, df = 4)2              3.762      2.959   1.271  0.20369    
ns(year, df = 4)3              8.127      4.211   1.930  0.05375 .  
ns(year, df = 4)4              6.806      2.397   2.840  0.00455 ** 
ns(age, df = 5)1              45.170      4.193  10.771  < 2e-16 ***
ns(age, df = 5)2              38.450      5.076   7.575 4.78e-14 ***
ns(age, df = 5)3              34.239      4.383   7.813 7.69e-15 ***
ns(age, df = 5)4              48.678     10.572   4.605 4.31e-06 ***
ns(age, df = 5)5               6.557      8.367   0.784  0.43328    
education2. HS Grad           10.983      2.430   4.520 6.43e-06 ***
education3. Some College      23.473      2.562   9.163  < 2e-16 ***
education4. College Grad      38.314      2.547  15.042  < 2e-16 ***
education5. Advanced Degree   62.554      2.761  22.654  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 35.16 on 2986 degrees of freedom
Multiple R-squared:  0.293, Adjusted R-squared:  0.2899 
F-statistic:  95.2 on 13 and 2986 DF,  p-value: < 2.2e-16

Smoothing splines for year and age.

library(gam)

gam_mod <- gam(
  wage ~ s(year, 4) + s(age, 5) + education,
  data = Wage
  )
summary(gam_mod)

Call: gam(formula = wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
Deviance Residuals:
    Min      1Q  Median      3Q     Max 
-119.43  -19.70   -3.33   14.17  213.48 

(Dispersion Parameter for gaussian family taken to be 1235.69)

    Null Deviance: 5222086 on 2999 degrees of freedom
Residual Deviance: 3689770 on 2986 degrees of freedom
AIC: 29887.75 

Number of Local Scoring Iterations: NA 

Anova for Parametric Effects
             Df  Sum Sq Mean Sq F value    Pr(>F)    
s(year, 4)    1   27162   27162  21.981 2.877e-06 ***
s(age, 5)     1  195338  195338 158.081 < 2.2e-16 ***
education     4 1069726  267432 216.423 < 2.2e-16 ***
Residuals  2986 3689770    1236                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects
            Npar Df Npar F  Pr(F)    
(Intercept)                          
s(year, 4)        3  1.086 0.3537    
s(age, 5)         4 32.380 <2e-16 ***
education                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(gam_mod, se = TRUE, col = "red")

from sklearn.preprocessing import OneHotEncoder, SplineTransformer

# Natural cubic spline features of year predictor
ns_tf = make_column_transformer(
  (SplineTransformer(
    n_knots = 4,
    # knots = 'quantile',
    degree = 3, 
    extrapolation = 'linear', # natural cubic spline
    # include_bias = False
    ), ['year']),
  (SplineTransformer(
    n_knots = 5,
    # knots = 'quantile',
    degree = 3, 
    extrapolation = 'linear', # natural cubic spline
    # include_bias = False
    ), ['age']),
  (OneHotEncoder(drop = 'first'), ['education']),
  remainder = 'drop'
)

# Define pipeline and fit to Wage data
pipe = Pipeline(steps = [
  ("ns_tf", ns_tf),
  ("model", LinearRegression())
])

# Fit pipeline
X = Wage.drop('wage', axis = 1)
y = Wage.wage
pipe.fit(X, y)
# R^2
pipe.score(X, y)
from statsmodels.gam.api import GLMGam, BSplines

# Create spline basis for year and age
x_spline = Wage[['year', 'age']]
bs = BSplines(x_spline, df = [4, 5], degree = [3, 3])

# Fit GAM
gam_mod = GLMGam.from_formula('wage ~ education', data = Wage, smoother = bs).fit()
gam_mod.summary()
# Plot smooth components
for i in [0, 1]:
  plt.figure()
  gam_mod.plot_partial(i, cpr = True)
  plt.show()

7 Lab

7.1 Polynomial Regression and Step Functions

  • To predict wage using a fourth-degree polynomial in age: poly(age, 4) in lm
  • poly function creates orthogonal polynomials, which are uncorrelated and have mean zero. Essentially it means that each column is a linear combination of the variables age, age^2, age^3 and age^4.
  • Or we can also use poly() to obtain age, age^2, age^3 and age^4 directly, if we prefer. We can do this by using the raw = TRUE argument to the poly() function.
attach(Wage)
fit <- lm(wage ~ poly(age, 4), data = Wage)
coef(summary(fit))
                Estimate Std. Error    t value     Pr(>|t|)
(Intercept)    111.70361  0.7287409 153.283015 0.000000e+00
poly(age, 4)1  447.06785 39.9147851  11.200558 1.484604e-28
poly(age, 4)2 -478.31581 39.9147851 -11.983424 2.355831e-32
poly(age, 4)3  125.52169 39.9147851   3.144742 1.678622e-03
poly(age, 4)4  -77.91118 39.9147851  -1.951938 5.103865e-02
fit2 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
coef(summary(fit2))
                            Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)            -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
poly(age, 4, raw = T)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
poly(age, 4, raw = T)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
poly(age, 4, raw = T)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
poly(age, 4, raw = T)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498
# Equilvalent to
fit3 <- lm(wage ~ age + I(age^2) + I(age^3) + I(age^4), data = Wage)
  • In performing a polynomial regression we must decide on the degree of the polynomial to use. One way to do this is by using hypothesis tests.
fit.1 <- lm(wage ~ age, data = Wage)
fit.2 <- lm(wage ~ poly(age, 2), data = Wage) 
fit.3 <- lm(wage ~ poly(age, 3), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4), data = Wage) 
fit.5 <- lm(wage ~ poly(age, 5), data = Wage) 
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
Analysis of Variance Table

Model 1: wage ~ age
Model 2: wage ~ poly(age, 2)
Model 3: wage ~ poly(age, 3)
Model 4: wage ~ poly(age, 4)
Model 5: wage ~ poly(age, 5)
  Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
1   2998 5022216                                    
2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
3   2996 4777674  1     15756   9.8888  0.001679 ** 
4   2995 4771604  1      6070   3.8098  0.051046 .  
5   2994 4770322  1      1283   0.8050  0.369682    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • As an alternative to using hypothesis tests and ANOVA, we could choose the polynomial degree using cross-validation, as discussed in Chapter 5.

  • We can also use step functions to fit a piecewise-constant function. The cut function is used to create a qualitative variable that represents the age range. The lm function can then be used to fit a step function to the age variable.

fit <- glm(I(wage > 250) ~ poly(age, 4), data = Wage, family = binomial)
  • Once again, we make predictions using the predict() function.
agelims <- range(age)
age.grid <- seq(from = agelims[1], to = agelims[2])
preds <- predict(fit, newdata = list(age = age.grid), se = T)
preds <- predict(fit, newdata = list(age = age.grid), type = "response", se = T)
  • In order to fit a step function, as discussed in Section 7.2, we use the cut() function.
fit <- lm(wage ~ cut(age, 4), data = Wage)
summary(fit)

Call:
lm(formula = wage ~ cut(age, 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.126 -24.803  -6.177  16.493 200.519 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)              94.158      1.476  63.790   <2e-16 ***
cut(age, 4)(33.5,49]     24.053      1.829  13.148   <2e-16 ***
cut(age, 4)(49,64.5]     23.665      2.068  11.443   <2e-16 ***
cut(age, 4)(64.5,80.1]    7.641      4.987   1.532    0.126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 40.42 on 2996 degrees of freedom
Multiple R-squared:  0.0625,    Adjusted R-squared:  0.06156 
F-statistic: 66.58 on 3 and 2996 DF,  p-value: < 2.2e-16

7.2 Splines

  • In order to fit regression splines in R, we use the splines library.
  • The bs() function generates the entire matrix of bs() basis functions for splines with the specified set of knots.
library(splines)
fit <- lm(wage ~ bs(age, knots = c(25, 40, 60)), data = Wage)
  • The df option to bs() can be used in order to produce a spline with a specified degrees of freedom.
dim(bs(age, knots = c(25, 40, 60))) 
[1] 3000    6
dim(bs(age, df = 6))
[1] 3000    6
attr(bs(age, df = 6), "knots") 
[1] 33.75 42.00 51.00
  • In order to instead fit a natural spline, we use the ns() function. Here ns() we fit a natural spline with four degrees of freedom.
fit <- lm(wage ~ ns(age, df = 4), data = Wage)
summary(fit)

Call:
lm(formula = wage ~ ns(age, df = 4), data = Wage)

Residuals:
    Min      1Q  Median      3Q     Max 
-98.737 -24.477  -5.083  15.371 204.874 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        58.556      4.235  13.827   <2e-16 ***
ns(age, df = 4)1   60.462      4.190  14.430   <2e-16 ***
ns(age, df = 4)2   41.963      4.372   9.597   <2e-16 ***
ns(age, df = 4)3   97.020     10.386   9.341   <2e-16 ***
ns(age, df = 4)4    9.773      8.657   1.129    0.259    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 39.92 on 2995 degrees of freedom
Multiple R-squared:  0.08598,   Adjusted R-squared:  0.08476 
F-statistic: 70.43 on 4 and 2995 DF,  p-value: < 2.2e-16
  • In order to fit a smoothing spline, we use the smooth.spline() function.
fit <- smooth.spline(age, wage, df = 16)
fit2 <- smooth.spline(age, wage, cv = TRUE)
Warning in smooth.spline(age, wage, cv = TRUE): cross-validation with
non-unique 'x' values seems doubtful
  • local regression
attach(Wage)
plot(age, wage, xlim = agelims, cex = .5, col = "darkgrey") > title("Local Regression")
logical(0)
fit <- loess(wage ~ age, span = .2, data = Wage)
fit2 <- loess(wage ~ age, span = .5, data = Wage)
lines(age.grid, predict(fit, data.frame(age = age.grid)), col = "red", lwd = 2)
lines(age.grid, predict(fit2, data.frame(age = age.grid)), col = "blue", lwd = 2)
legend("topright", legend = c("Span = 0.2", "Span = 0.5"), col = c("red", "blue"), lty = 1, lwd = 2, cex = .8)

7.3 GAMs

  • In order to fit a GAM in R, we use the gam function, which is part of the gam library.
  • The s() function, which is part of the gam library, is used to indicate that s() we would like to use a smoothing spline.
library(gam) 
gam.m3 <- gam(wage ~ s(year, 4) + s(age, 5) + education, data = Wage)
gam.m1 <- gam(wage ~ s(age, 5) + education, data = Wage) 
gam.m2 <- gam(wage ~ year + s(age, 5) + education, data = Wage)
anova(gam.m1, gam.m2, gam.m3, test = "F")
Analysis of Deviance Table

Model 1: wage ~ s(age, 5) + education
Model 2: wage ~ year + s(age, 5) + education
Model 3: wage ~ s(year, 4) + s(age, 5) + education
  Resid. Df Resid. Dev Df Deviance       F    Pr(>F)    
1      2990    3711731                                  
2      2989    3693842  1  17889.2 14.4771 0.0001447 ***
3      2986    3689770  3   4071.1  1.0982 0.3485661    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We can also use local regression fits as building blocks in a GAM, using the lo() function.
gam.lo <- gam(wage ~ lo(year, age, span = 0.5) + education, data = Wage)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
liv too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv
too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
liv too small.  (Discovered by lowesd)
Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv
too small.  (Discovered by lowesd)
plot(gam.lo, se = TRUE, col = "green")